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Abstract 

The  Fetch-and-add  (F&A)  operation  lias  been  used  cfTectively  in  a  number  of  process  coordina- 
tion algorithms.  In  this  paper  we  assess  the  power  of  Fetch-and-incrcmenl  (F&I)  and  Fetch- 
and-decrement  (F&D),  which  wc  view  as  restricted  forms  of  F&A  in  which  the  only  addends 
permitted  are  ±1.  F&A-based  algorithms  that  use  only  unit  addends  arc  thus  trivially  expressed 
with  just  F&I  and  F&D.  Our  primary  contribution  is  a  bottleneck-free  readers/writers  algorithm 
that  uses  only  F&I  and  F&D.  We  also  indicate  how  existing  F&A-bascd  algorithms  for  qucucs- 
with-muhiplicity  can  be  simplified  to  an  algorithm  using  just  I'&l  and  F&D.  Finally,  we  discuss 
a  general  technique  for  replacing  F&A  by  F&I/F&D  at  a  cost  logarithmic  in  the  number  of  pro- 
cessors. 


1.    Introduction 

The  Fetch-and-add  in.<;truction,  introduced  in  |GK81],  has  been  used  in  a  number  of  process 
coordination  algorithms  (see,  for  example,  |(iI.R83|  (WilS8|  |  r'S'891).  T'ctch-and-add  is  essen- 
tially an  indivisible  add  to  memory;  its  format  is  FAA(]',r),  where  r  is  an  integer  variable  and 
e  is  an  integer  expression.  The  operation  is  defuicd  to  return  the  (old)  value  of  V  while  atomi- 
cally  replacing  I'  by  the  sum  V  +  c.  Wc  refer  to  F  as  the  target  and  e  as  the  addend  of  the 
fetch-and-add.  Important  special  cases  occur  when  the  addend  is  i  1,  which  arc  referred  to  as 
fetch-and-increment  and  fetch-and-dccremcnl  (FctI  and  I'&D).  For  example,  if  multiple 
processes  execute  r&lil')  (i.e..  FdA(V,\))  concurrently,  the  values  returned  are  consecutive 
integers.  These  values  can  then  be  used  as  indices  into  an  array  with  the  assurance  that  each 
array  element  is  assigned  to  exactly  one  process.  Qi'cue  algorithms  have  been  developed  using 
this  observation  [GLR83]. 

Many  of  the  F&A-bascd  algorithms  arc  hnttlcncrk-frcc.  i.e..  the  computational  complex- 
ity is  independent  of  the  number  of  processors  when  the  algorithm  is  executed  on  a  CRCW 
PRAM  [Sni82]'.  For  example  P  processors  can  each  insert  an  item  onto  a  single  queue  in  the 
time  required  for  just  one  insertion. 

The  present  paper  studies  the  following  question:  Ciiven  a  I  &A  based  algorithm,  does 
there  exist  an  equally  efficient  algorithm  using  only  F&I  atui  F&D?  For  algorithms  in  which 
the  only  addends  used  are  ±  1,  such  as  the  queue  algorithms  suggested  above,  the  answer  is 
trivially  yes.  One  reason  for  our  interest  in  this  question  is  that  hardware  realizations  of  F&I 
and    F&D   are   somewhat  easier   than    for   F&A.     For  example,   an   effective   technique   for 


'For  our  purposes  a  CRCW  PRAM  may  lie  considered  an  idciliz-cd  nniliiproccssor  in  vvhidi  a  memory  refer- 
ence requires  but  a  single  cycle  Concurrent  rcrcrcnccs  lo  llic  same  localion  arc  pcrmillcd,  all  cninplric  in  one  cycle, 
and  yield  the  same  results  as  if  executed  in  sonic  (arbilrarily  chosen)  serial  order. 


combining  concurrent  F&I's  on  a  single-bus  multiprocessors  is  discussed  in  [GSSSQ].  (This 
same  paper  also  notes  that  the  queue  algorithms  mentioned  above  use  only  unit  addends.) 

The  readers/writers  problem  of  Courtois,  Ileymans.  and  Parnas  |CnP71]  has  lead  to  a 
bottleneck-free  algorithm  in  which  readers  execute  F&A  with  unit  addends  but  writers  use 
addends  ±K,  for  a  large  constant  K  (the  idea  is  that  readers  need  one  unit  of  a  resource,  writ- 
ers need  A'  units,  and  exactly  that  number  of  units  are  available)  [GLRS^p.  This  report  offers 
a  new  bottleneck-free  readers/writers  algorithm  (as  well  as  modifications  to  other  F&A  based 
coordination  algorithms)  using  only  F&I  and  F&D. 

We  also  consider  the  general  problem  of  implementing  I'&A  in  terms  of  f'&I  and  F'&D 
and  discuss  a  solution  using  "software  combining"  |YTfS61  which  simulates  F&A  with  a  time 
and  space  complexity  log(/')  for  P  processors. 

2.    Bottleneck-free  F&I  algorithm  for  readers  and  writers 

The  readers  and  writers  synchronization  paradigm,  described  in  |("IIP71]  coordinates  access  of 
two  classes  of  processes  to  a  protected  resource.  Processes  in  the  first  class,  writers,  require 
exclusive  access;  whereas,  processes  in  the  second  class,  readers,  may  share  the  resource  with 
other  readers.  One  often  imposes  add'iUon  fairness  conditions  limiting  the  time  a  process  may 
need  to  wait  for  access.  The  most  likely  cause  of  significant  waiting  is  that  readers  are  permit- 
ted to  begin  while  other  readers  are  active  and  thus  a  continual  stream  of  readers  may  starve 
all  writers.  We  eliminate  this  possibility  by  the  standard  technique  of  giving  writers  priority, 
which  naturally  does  not  prevent  writers  from  starving  readers.  In  addition,  our  implementa- 
tion (like  the  F&A  implementation  in  [GFR83])  uses  an  unfair  semaphore  that  allows  writers 
to  starve  other  writers.  These  potential  starvations  Iiavc  not  proved  troublesome  in  practice. 
(There  are  reader-priority  variations  of  both  the  F&A  and  I'&l  implementations.  Fair  algo- 
rithms are  also  known  for  both  F&A  and  F&I.) 

Our  algorithm  (see  Figure  1.)  uses  a  synchronization  technique  insnred  by  |Fdl84]  and 
employs  two  shared  variables,  NumReaders  and  NumWriters,  counting  the  number  of 
active  readers  and  writers  respectively.    A  proof  of  correctness  will  appear  in  a  subsequent  re- 
port, here  we  give  an  informal  argument. 

The  Writer  procedure  is  essentially 

P  (NumWriters  )  ;     Wait  for  readers  in  exit;    il'riic:   V(NumWri  ters ) 
The  (binary)  semaphore  prevents  two  processes  from  executing  Write  concurrently.    The  sema- 
phore algorithm  used,  which  has  been  taken  from  |GLR8.'^],  ensures  that  NumWriters  is 
positive  whenever  a  process  is  granted  the  semaphore.    This  property  is  used  by  the  readers 


^  We  consider  this  solution  to  be  bolllcneck  free  bccnuse  an  arbilrnry  number  of  readers  can  execute  in  con- 
stant time.    Writers  are  serialized;  however,  tills  is  required  by  the  specification  of  the  problem. 
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Shared  variables : 

NumReaders  :=0,  NumWriters  :=0:  integer; 

procedure  Reader  is 
var 

Granted  boolean;  --private 
begin 

Granted = False; 
repeat 

while  (NumWriters  >  0)  --  waif  for  writer  in  exit 

skip; 
f&i  (NumReaders)  ;  --  try  to  ^et  lock 
if  (NumWriters  ==0)  --  did  a  writer  heat  nic  into  the  lock? 

Granted  =  True;     -  no,  I  have  the  lock! 
else 

f&d  (NumReaders )  ;  --yes;  undo  increnuvu.  wait,  and  try  again 
endif 
until  (Granted); 
—  Perform  reader  action 
f&d  (NumReaders  )  ;  -- release  lock 
end 

procedure  Writer  is 
var 

AmFirst  boolean;  --private 
begin 

AmFirst  =  False; 
repeat 

while  (NumWriters  >  0)  --  only  one  writer  at  a  time 

skip; 
if  (  f&i  (NumWriters  )  =  0  )  --  incrcituvit  NumWriters,  am  1  first'? 

AmFirst  =  True; 
else 

f&d(NumWriters )  ;  --  no.  undo  increment,  wait,  and  try  again 
endif 
until  (AmFirst); 
while  (NumReaders  !=  0)  --  wait  .for  readers  to  exit 

skip; 
--  Perform  writer  action 
f<S:d(NumWriters  )  ;  --  release  the  lock 
end 

Figure  1,    Readers  and  Writers  algorithm. 
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procedure.    Similarly  NumReaders  is  positive  if  there  arc  any  active  readers  and  the  writers 
use  this  property  to  wait  for  readers  to  exit.    Mutual  exclusion  of  readers  and  writers  is  en- 
forced   by    the    order    of  operations,    a    writer   increments    NumWriters    before    it    reads 
NumReaders  and  a  reader  increments  NumReaders  before  it  reads  NumWriters. 

It  remains  to  show  that  the  algorithm  is  free  of  deadlocks  and  that  readers  cannot  starve 
writers.    It  is  easy  to  see  that  if  no  writers  are  competing  for  the  resource,  NumWriters  =  0 
and  hence  readers  are  free  to  enter.    Similarly,  if  no  readers  arc  competing,  NumReaders  =  0 
and  one  writer  is  free  to  enter.    Thus  it  suffices  to  show  that,  if  both  readers  and  writers  are 
competing,  then  within  finite  time  a  writer  will  reach  the  write  command.    In  order  to  show 
this,  we  first  note  that  writers  do  not  decrement  NumWriters  if  they  detect  a  competing 
reader.    By  keenng  NumWriters  >  0,  the  writers  force  both  new  and  currently  competing 
readers  to  remain  at  the  empty  while  loop.    Once  all  competing  readers  are  in  this  loop, 
NumReaders  =  0  and  (exactly)  one  writer  is  free  to  reach  the  wriir  command 

3.  Semaphores 

The  faa-based  P  and  V  algorithms  of  GfR  use  only  unit  addends  and  hence  are  trivially 
expressible  with  F&I/F&D.  These  algorithms  include  both  binary  (i.e.  ordinar})  and  general- 
ized semaphores,  (the  latter  structures  permit  up  to  a  prcspccified  M  processes  to  obtain  a 
resource,  binai7  semaphores  arc  the  special  case  M  =  \).  In  addition  [GLR8.3]  discusses 
PChunk/VChunk,  where  multiple  units  of  the  resource  can  be  sei7cd/rcleased  atomically. 
These  algorithms  appear  to  require  F&A  with  nonunit  addends  and  hence  the  best  we  can 
achieve  with  F&I/F&D  is  the  logarithmic  cost  simulation  of  F&A  discussed  in  the  next  sec- 
tion. 

4.  Simulation  of  F&A  with  F&I  using  Software  Combining 

F&A  can  be  simulated  by  F&I  in  software.  Note  that  there  are  semantic  as  well  as  per- 
formance considerations  to  address.    For  example.  F&A(V,2)  is  not  equivalent  to 

F&I(V);  F&I(V) 
since  the  latter  permits  another  process  to  see  the  result  of  just  otic  increment.  Atomicity  can 
be  achieved  by  encapsulating  non-atomic  instruction  sequences  in  an  exclusive-access  critical 
section,  however  this  solution,  although  correct,  leads  to  serial  bottlenecks.  These  bottlenecks 
can  be  reduced  to  time  logarithmic  in  the  number  of  processors  by  using  a  technique  known  as 
software  combining  [YTL86|.  The  version  of  software  combining  presented  in  |GSS89]  uses 
only  F&I. 

5.  Multiqueues 

Multiqueues  are  queues  supporting  the  atomic  insertion  of  items  with  arbitrary  multiplici- 
ty.   This  operation  is  written  multienque.ue(q ,i ,m)  and  is  functionally  equivalent  to  m  conven- 
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tional  enqueues  of  item  /  into  queue  q.  Two  families  of  multiqueue  algorithms  have  been 
developed  and  arc  named  to  reflect  worst-case  performance;  linear  multiqueues  and  logarithm- 
ic multiqueues. 

Linear  multiqueues  utilize  a  bottleneck-frcc  queue  such  as  the  one  suggested  in  the  intro- 
duction, to  store  the  items  (/)  along  with  their  multiplicities  {m)  in  structures  we  refer  to  as 
multi-items.  Delete  operations  always  access  the  head  multi-item  (and  hence  form  a  serial 
bottleneck  if  the  average  multiplicity  is  low).  The  deletes  decrement  m  and,  when  the  head 
multi-item  is  exhausted  (i.e.,  m  becomes  zero),  it  is  de-queued  and  the  next  item  becomes  the 
head.    Multi-inserts  add  multi-items  to  the  tail  of  the  queue. 

The  linear  multiqueue  algorithms  presented  in  [GLRS.^J  and  [WilSS|  maintain  a  count  of 
the  total  number  of  items  (including  their  multiplicities)  in  the  queue.  This  count,  which  is  in- 
cremented by  m  on  each  multi-insert,  is  used  only  to  determine  if  the  multiqueue  is  empty. 
However,  the  linear  multiqueue  is  empty  if  and  only  if  the  number  of  multi-items  is  zero  and 
this  count  can  be  maintained  using  only  unit  addends.  Ihc  resulting  algorithm  has  compar- 
able complexity  to  the  existing  F&A  based  solution. 

Concurrent  multi-inserts  are  performed  in  constant  time  but  the  time  required  for  con- 
current deletes  is  proportional  to  the  number  of  multi-items  that  need  to  be  accessed.  Hence 
the  parallelism  possible  is  the  average  multiplicity.  If  the  multiplicities  are  low,  the  simple 
parallel  queues  discussed  in  the  introduction  arc  appropriate;  if  they  arc  high  linear  multi- 
queues  work  well. 

Logarithmic  multiqueues  (see  [GLR83],  [Wil88])  have  complexity  log{qucucsizc)  for  all 
operations  making  them  the  recommended  structure  if  the  item  multiplicities  of  inserts  are 
unknown  or  may  vary  greatly.  One  implementation  ([Dim8,'>l.  sec  also  |Wood89])  uses  only 
one  F&A  with  non-unit  addend.  Simulating  this  operation  with  F&l  as  described  in  section  4, 
adds  a  complexity  term  logarithmic  in  the  number  of  processors.  I'or  most  applications,  this 
term  is  dominated  by  the  basic  \o^{qucucsize)  cost  of  logarithmic  multiqueues  and  hence  the 
simulation  is  free  asymptotically. 

6.    Group  Lock 

Group  lock,  introduced  in  (Dim86].  is  a  process  coordination  mechanism  that  dynamically 
groups  processes  attempting  to  execute  a  protected  section  of  code.  Once  a  group  is  formed, 
no  additional  processes  can  enter  the  protected  code  section  until  all  processes  in  the  current 
group  have  exited  it.  A  F&I  implementation  of  group  lock  has  been  developed  and  will  be 
discribed  in  a  subsequent  paper. 


llltracompiitcr  Note  #1.^9  '  Page  6 


7.  Conclusion 

Wc  have  demonstrated  that  several  important  fctch-and-add  based  coordination  algorithms 
can  be  implemented  as  elTiciently  using  just  fetch-and-incremcnt  and  fetch-and-decrement.  For 
some  algorithms,  the  result  was  trivial;  only  unit  addends  were  used  in  the  F&A  version.  For 
linear  multiqueues,  we  observed  that  the  only  non-unit  addend  can  be  eliminated.  Our  main 
contribution  was  a  new  algorithm  for  readers  and  writers.  Finally,  we  have  shown  that  in  gen- 
eral F&A  can  be  simulated  in  logarithmic  time  using  F&F  Designers  of  new  computers  with 
F&A  hardware  should  weigh  the  relative  simplicity  of  F&l  against  the  resulting  algorithmic 
IradeofTs. 
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